Skip to content

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473

Open
hanbitmyths wants to merge 4 commits into
microsoft:mainfrom
hanbitmyths:smp-qkv-aware-multi-gpu
Open

selective_mixed_precision: QKV-aware overrides, AUTO memory mode, MULTI_GPU dispatch#2473
hanbitmyths wants to merge 4 commits into
microsoft:mainfrom
hanbitmyths:smp-qkv-aware-multi-gpu

Conversation

@hanbitmyths
Copy link
Copy Markdown
Contributor

@hanbitmyths hanbitmyths commented May 22, 2026

This PR hardens SelectiveMixedPrecision (SMP) for real-world LLMs targeting ONNX Runtime GenAI:

  1. QKV-aware quant config overrides (olive/passes/pytorch/quant_utils.py): Normalize the per-layer override dict so that the Q, K, and V projections in the same attention block always share precision. ModelBuilder's GQA fusion requires this; without it, partial overrides silently break export on Qwen-style models.

  2. AUTO kld_memory_mode (olive/passes/pytorch/selective_mixed_precision.py): A new auto setting selects among full, multi_gpu, low_memory, and offload based on visible GPU memory and estimated model footprint, and logs the decision (e.g. KLD memory mode auto-selected: multi_gpu (gpus=3, full=145.14GB, multi_budget=215.86GB, ...)).

  3. New multi_gpu mode: Uses accelerate.dispatch_model + infer_auto_device_map with _no_split_modules honored. After infer_auto_device_map, every model.layers.N.* entry is coalesced to the first device assigned for that layer, and a defensive check falls back to low_memory if a decoder layer still spans devices. A diagnostic info log reports the per-device layer counts.

Validation (A100 VM)

  • Qwen3-0.6B old vs new export: tokens identical (124 vs 116 overrides, new_missing_qkv_partners=[]), same 657 MB output, ~301 vs 309 tok/s.
  • Qwen2.5-1.5B-Instruct export + ort-genai: 1.34 GB int4, 290 tok/s.
  • Qwen2.5-14B-Instruct AUTO → MULTI_GPU (3×A100), 9.44 GB int4, 95 tok/s.

MMLU 0-shot (HF fp16 vs ort-genai int4, greedy)

Model N PyTorch ort-genai Δ
Qwen3-0.6B 500 36.6% 28.6% −8.0 pp
Qwen2.5-1.5B-Instruct 500 60.2% 54.2% −6.0 pp
Qwen2.5-14B-Instruct 250 74.8% 77.2% +2.4 pp (within ±5.5 pp CI)

14B is essentially lossless; the small-model deltas are inherent to int4 SMP on sub-2B parameters, not regressions introduced here.

Checklist before requesting a review

  • Add unit tests for this change.
  • Make sure all tests can pass. (24 passed, 1 skipped in test_selective_mixed_precision.py)
  • Update documents if necessary.
  • Lint and apply fixes to your code by running lintrunner -a
  • Is this a user-facing change? If yes, give a description of this change to be included in the release notes.

Release note: SelectiveMixedPrecision now supports an auto setting for kld_memory_mode and a new multi_gpu mode that shards the KLD-scored forward across visible GPUs via Accelerate. Quant config overrides are normalized so Q/K/V projections in the same attention block share precision, ensuring compatibility with ModelBuilder GQA fusion.

hanbitmyths and others added 3 commits May 22, 2026 16:44
…TI_GPU dispatch

- Normalize per-layer quant config overrides so Q/K/V projections in the same
  attention block share precision, required by ModelBuilder for GQA fusion.
- Add AUTO setting for kld_memory_mode that picks among FULL, MULTI_GPU,
  LOW_MEMORY, OFFLOAD based on available GPU memory and model size.
- Add MULTI_GPU mode that uses Accelerate's dispatch_model with
  _no_split_modules honored, plus a coalescing pass that pins every
  model.layers.N.* entry to a single device and falls back to LOW_MEMORY if a
  decoder layer still spans devices.
- Tests: 24 unit tests covering QKV grouping, AUTO selection thresholds, and
  the MULTI_GPU device-map coalescing path.
@hanbitmyths hanbitmyths marked this pull request as ready for review May 22, 2026 23:51
Copilot AI review requested due to automatic review settings May 22, 2026 23:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR strengthens the SelectiveMixedPrecision (SMP) PyTorch pass for LLMs targeting ONNX Runtime GenAI by (a) enforcing Q/K/V consistency in both scored selection and quantization overrides, and (b) adding an auto/multi_gpu KLD-gradient scoring memory mode selection to make scoring practical on large models.

Changes:

  • Add Q/K/V-aware grouping so scored selection promotes attention input projections together, and normalize quantization overrides so Q/K/V share the most-precise config.
  • Introduce kld_memory_mode with auto resolution plus a new multi_gpu mode using Accelerate dispatch and device-map coalescing/validation.
  • Expand unit tests to cover QKV grouping/normalization, KLD scoring equivalence across memory modes, and AUTO/MULTI_GPU selection behavior.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
olive/passes/pytorch/selective_mixed_precision.py Adds QKV grouping in scored overrides and implements AUTO/FULL/MULTI_GPU/LOW_MEMORY/OFFLOAD KLD scoring paths with heuristics and Accelerate-based sharding.
olive/passes/pytorch/quant_utils.py Adds QKV group discovery + override normalization to ensure attention input projections share a consistent quant config, including support for excluded attention inputs.
test/passes/pytorch/test_selective_mixed_precision.py Adds extensive unit tests for QKV grouping/normalization and KLD scoring/memory-mode behavior, including MULTI_GPU dispatch stubbing.

Comment thread olive/passes/pytorch/quant_utils.py
Comment thread olive/passes/pytorch/selective_mixed_precision.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants